I am through with
XFS, once and for all. Well, at least for laptops.
I still think it's a good filesystem when you can ensure that the power never
goes, and your hardware is reliable, but it's just not adequate for laptops or
even desktops.
I ran into
some serious problems a while ago, but managed to recover. Two
nights ago, however, three
XFS filesystems on my laptop decided to blow up
and left my system thoroughly broken. I guess as the
hibernate maintainer,
I should really start doing my tests somewhere else than my main system...
It all started out with a
dist-upgrade and this output:
dpkg: error processing /var/cache/apt/archives/dpkg_1.13.22_i386.deb
(--unpack): unable to make backup link of
'./usr/share/man/man1/dpkg-deb.1.gz' before installing new version: Unknown
error 990
Looking at
/usr/share/man/man1, I started to anticipate the apocalypse:
# ls -l /usr/share/man/man1
total 7956
?????????? ? ? ? ? ? ? 7zr.1.gz
?????????? ? ? ? ? ? ? 822-date.1.gz
?????????? ? ? ? ? ? ? CA.pl.1ssl.gz
?????????? ? ? ? ? ? ? Defoma::Common.1.gz
So I look at the log, and amidst kernel oops notices, there's this lovely
cookie:
Filesystem "hda6": Corruption of in-memory data detected. Shutting down
filesystem: hda6
Filesystem
hda6 is
/usr, so at that time I figured "it could have been
worse", booted to single user, and remade the filesystem with the intention to
simply reinstall all packages... when I found
/var/lib/dpkg/info to be
in similar condition. The rest of /var
seemed fine, but I resolved then that
there was no hope in reviving this system.
Fortunately I brought an external drive that had just enough free space to
hold my
/home and some other stuff, but since USB is
really slow when it
comes to shifting large amounts of data, I decided to do something productive
in the mean time and to answer some outstanding mails. It wasn't difficult to
get SSH back up, so I started to work on a remote machine and used the time
efficiently.
Some time later, though, I got confused in the mist of
screen sessions and
was browsing my home directory on the laptop, thinking I was elsewhere (my
home directories are mostly synchronised), when I noticed a directory in
similar condition as the above. Oh shit. Imagine my pain and fear as I first
thought my remote machine was also dying, imagine the sigh when I found out
I was on the local filesystem, and imagine the shock when I realised that
/home was
also affected by the
XFS breakage...
A glance around
/var confirmed that the
XFS breakage was actually
spreading and had now affected three filesystems on this machine. Fortunately,
by that time, I had copied everything to the external drive, and decided to
put my laptop and myself to sleep.
I woke the next morning to the task of reinstalling the thing and decided to
be optimistic about it. After all, a reinstall would mean I could finally try
partman-crypto and encrypt my laptop's data to protect against leaking
sensitive stuff in the case of loss or theft of my laptop.
The installation was not as painless as I had hoped, but that was mainly
because I ran into a known problem with the graphical installer and
partman-crypto, which does not allow to set up volumes with random
encryption keys (e.g. swap; see the forthcoming announcement for the beta3
release of the installer), and a bunch of smaller bugs. I had to restart the
installation with the traditional frontend to get what I wanted, but other
than that,
I was very impressed with what our installer development team
has accomplished! And a special round of gratitude to Frans Pop for not losing
his patience while helping me on several occasions throughout the process.
Now, 24 hours after the incident, I am back to normal with a fresh laptop and
no data lost (except for one directory which I pulled from a mirrored remote
machine; it had no local modifications (so why did
XFS screw it up
anyway?)). The fonts are all jaggy, so there's something I have to figure out.
All things considered, I am sad to have lost 24 hours, but I can also relax
more now, without fear of further
XFS breakage or loss of private data.
Update: Oh, and despite
this, I did choose
ext3 for all my laptop's
filesystems. JFS was really cumbersome and slow last time I tried it, and
I surely would not touch RazorFS after experiencing serious data loss on
numerous occasions.
Update:
Two responses so far. Full ack for
Julien (except for him
laughing at me),
Ingo's post warrants a reply though.
First,
ext3 is also journaled, and if you're about to say "yeah, but it's
a hack on top of
ext2, well...
ext2 is damn mature, and journaling
isn't really rocket science, so that "hack" isn't going to be too complicated.
In fact, I
like the idea of journaling being an option rather than
a built-in feature.
Second, of course you're supposed to keep backups. But since you keep backups,
my top requirement of a filesystem is not "how to get the data back", but "how
to ensure it does not break. If it breaks, I can reinstall and restore from
backup, but that's a certain amount of time lost. If it doesn't break, well,
that's like
stealing a little something back from death then, isn't it?
Third, I
do follow the
linux-xfs mailing list, but so what? I did
not
have write cache enabled, and I was running the 2.6.17.7 kernel at the time of
the mishap.
Lastly, you point to "excellent tools" to recover the filesystem. I am not
sure how excellent
xfs_repair really is when it reports "
bad magic number
0x0 on dir inode 4696727" during the run, claims to have fixed it, I mount
the filesystem, unmount it, run
xfs_repair again and get the
same
message.
No filesystem is perfect, and as we know from
Biella's problems (among many
others,
ext3 is no exception. But we
did get her data off! So then it's
really an open field again, crap filesystem against crap filesystem. I guess
at this point it helps to know that
ext3 actually follows VFS semantics,
while on
XFS, a completed
sync() syscall does not actually mean it has
written the data to disk (see e.g.
#317479). And then there are bugs like
#239111...
ext3 it is for now. If that let's me down, I'll try JFS. If that fails and
noone has actually implemented a proper filesystem, I might have a go myself.
Haha.
Update: Alceste Scalas adds:
Ingo is right when he says that every filesystem has bugs --- but bugs
apart, the design of Ext3 (i.e. its physical-block journaling) makes it
a far more reliable choice for desktop and laptop PCs, expecially for
people without an UPS. An Ext3 filesystem could only crash because of a bug
or an hardware failure, while an XFS filesystem can be trashed even
without bugs or hardware failures, due to the unavoidable consequences of
a power loss on PC-class hardware.
He also alerted me to
this mailing list post, which compares
data=ordered journaling of
ext3 (which almost noone does for
performance reasons) with
XFS and RazorFS.
Update: You may also be interested in
this post.
Update: Otavio Salvador points me to
this FAQ entry, which SGI must
have added very lately. It explains how to deal with the directory corruption
that was part of my problem. I guess I would have liked to know earlier, but
I consider the outcome with
dm-crypt +
ext3 a win anyhow.